Harvard_University_logo.svg.png

Harvard-Extension-School.png

Master of Liberal Arts, Data Science

CSCI E-83 Fundamentals of Data Science in Python

Professor's name: Stephen Elston

Author's name: Dai Phuong Ngo (Liam)

Analyzing Housing Affordability: Exploring Key Influential Features by Regression and Statistical Modeling

Introduction and Background

The housing market has long been a crucial area of interests for businesses, individuals and especially policy makers. In the mortgage banking industry where I am working for at the moment, understanding factors impacting on housing affordability is critical for tailoring financial products, advising clients and managing risks from different perspectives. In Canada particularly, housing affordability has become increasingly challenging due to rising sales prices, limited inventory because of slow construction versus high demand, and economic uncertainty. This triggered the requirement for mortgage providers and underwriters, real estate brokers, consumers to evaluate the most influential factors driving housing affordability.

The issues might face complexity due to the wide range of direct features impacting on sales price, such as living space, condition, property type or less obvious features, such as, year remodeled or basement size. An unwanted fact is that the distribution of sales price is often skewed with extreme outlers complicating further analysis. Traditional evalutions might not succeed in understanding all of these variations, therefore, empowering advanced statistical techniques is important to understand and diagnose affordability better.

Project Goal

This project aims to analyze the key housing features that affect sales prices and housing affordability to provide actionable insights. Inferences retrieved from the data can help to identify features having greatest impact on affordability and generate further insights for homebuyers, who have to make informed decision on which housing option is compatible with their affordability, real estate brokers, who have to customize advice and recommendations to individual needs, and mortgage providers or banks, like my current company, who have to set up financing solutions based on metrics of housing affordability. Therefore, the final goal of this project is to develop statistical solutions with sampling techniques, OLS regression models and Bayesian analysis to assess contributions from the most influential housing features to housing affordability. This emphasis will be on inference, rather than prediction, and will address upcoming challenges expressed by skew distributions in prices and their outliers to make sure conclusions are drawn in a statistically reliable, practical and relevant way.

What This Project Is Not About

It is import to clarify that this project is not about price prediction or prediction's optimization. Instead, it concentrates on statistical inference, which determines the impact of different features on affordability and provides a statistical framework to interpret their impacts to indentify the most crucial influencers. Machine Learning algorithms such as tree-nased models or neural networks are not applicable in this project as it focus on capturing relationships among features. Also, this project does not set certain affordability thresholds manually and affordability's classification.

Data

A challenge arises with the data availability for the Canadian context that I am unable to find a good enough or ready-to-consume data with domain in Canadian housing affordability. I tried to look through websites of Government of Canada or Canada Statistics but their data are scattered, not yet combined, time series type and require significant amount of time to find enough data aspects for manipulationa and combination, regarless of their individual dataset's completeness across features. Therefore, I decided to focus on statiscal modelling techniques analyzing housing affordability with usage of this data from the American housing context of Ames city.

The data comes directly from Kaggle with two sets: training and testing. I will use the training for data exploration, preparation, assessment and model development before applying the best model on the testing set. The data has 79 explanatory variables which are manageable and describe aspects of residential properties in Ames, Iowa. This data provides an excellent source to study housing trends and affordability with a good amount of meaningful features and data records describing different aspects of housing. The modelling will take all features as influencers to be analyzed, visualized, sampled and modelled to assess the final set of best features on different samples of the US house prices and affordability in Ames. Later, the training data can be split into 2 samples: training and testing for further model evaluation.

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques

Import Libraries

Load Data

Part 1: EXPLORATORY DATA ANALYSIS

Review data type & Cast MSSubClass

As SubClass was initially identified as int64, I will cast it as object although its values are numeric.

Check for null or NaN

As NaN values exist in some features, I will provide some treatments to fill with 0 or median values or certain values based on the Data Dictionary, depending on certain features. For some features related to Year, I will leave them as they are for now.

Create Total Living Area feature

Create Total Bath feature

By creating two new features as above, relationships in data can be more effectively captured by aggregated features with combined effects. Total living area can explain house prices better than splitting it across individual floor areas. Total bath count incorporates the effects of full and half bathrooms in a more meaningful way. This is as expected as home buyers tend to care more about the total living space and the total number of bathrooms that could include both full bath and half bath. These two new features simplify the model while preserving the total living space's explanatory power. Also, they can eliminate the correlation between the original features and stabilize the latter models as high correlation between independent variables leads to multicollinearity which inflates standard errors of coefficients and makes coefficients unstable and difficult to interpret.

Recode Building Type and House Style

Histogram before Convertion

Address & Remove Zero Inflated Features

Explanation for removing the columns

As per the Histogram plots and domain knowledge:

Due to low explanatory power or redundancy:

Due to high dimensional features with excessive levels:

Due to minimal relationship with target:

Due to zero-inflated effects:

Add new calculated features

Creating new features like Sale Price by Total Living Area can provide meaningful insights into the relative cost and value of properties. There are several reasons why creating such new features is important for further analysis that I will describe as below:

First, it helps to normalize sale price for better comparison across properties of varying sizes. As sale prices of properties can vary greatly as per their living area. If we use the absolute prices directly, it might lead to misleading comparisons. For example, a small house with a high sale price might seem expensive but it might actually have lower cost per square foot than a larger house with a slightly higher sale price.

Secondly, it captures affordability in a more meaningful way. The sale price per square foot reflects affordability mroe realistically than total sale price. Buyers used to consider the cost per square foot as their own benchmark for affordability and value for purchase. When they are considering two houses at the same price, different sizes of them can differ in perceived affordability.

Thirdly, these calculations helps to handle skewness and outliers better in sale price distribution by reducing skewness and mitigating the outliers' influence. If a very large house with an unsually high sale price appear on the listing, it can skew the results. Therefore, by normalizing with new calculations, it makes features of this property more interpretable and statistically stable.

Fourthly, it can help improve predictive models like OLS regression to explain variations in sale price better by incorporating both price and area into a single metric which yield stronger predictos and reducing collinearity.

Furtheremore, it aligns with domain knowledge in real estate to evaluate property value and market trends and makes results more interpretable for buyers, sellers and mortgage providers.

In this practice, the codes and calculation can ensure this features as denominators to be non-zero before dividing. However, even after generating new metric, it can still be skewed. Later on, a log transformation can normalize them further and a scaling technique can help scale it in a standard way to enhance model performance before plotting in correlation matrices to confirm it significance.

Select numerical features excluding ID, Year-type columns

Any datetime type features will be excluded for further EDA as they should be used for Time Series Analysis.

Histogram before Transformation

QQ Plot before Transformation

Before any transformations, many features such as SalePrice, LotArea, LotFrontageArea, TotalLivingArea exhibit heavy skewness, especially towards the right. Features like Price_Per_LotArea and Price_Per_TotalLivingArea illustrate a non-normal distribution with long tails. Also, features such as OpenPorchSF, EnclosedPorch and Fireplaces exhibit a significant proportion of zero values, whose presence contribute to skewness and non-normal distribution with a lack of linearity. Even when looking at the QQ plots, many features demonstrate significant deviations from the diagonal line, suggesting non-normal distributions, with extreme tail deviations. These deviations suggest that linear assumptions for features are not met.

Transformations

Histogram and QQ after Transformation

After the Log Transformation, the transformed features like Log_LotArea, Log_Price_Per_1stFlrSF, Log_Price_Per_TotalLivingArea and Log_GrLivArea appear much more symmetrical and closer to a bell shaped distribution. Meanwhile, a feature like Log_2ndFlrSF still shows non-normal distrbution and residual non-symmetry due to its heavily zero-inflated features. This makes sense as a number of properties might not have garage, basement or 2nd floor so I will retain them as they are.

Even outliers are mitigated and compressed, making sure that they will not heavily affect the analysis. The data spread becomes more even, which is better for variability. Likewise, the normality becomes more obvious, especially that the log transformed Sales price has a more balanced target variable for the latter regression models. Therefore, the log transformation did help to improve feature scaling and limits heteroscedasticity that will improve model performance and interpretation in the next phase of this project. In contrast, features with lower skewness after log transformation might be retained such as Log_1stFlrSF, Log_LotFrontage, Log_LotArea potentially impact on the affordability modeling significantly.

It is noteworthy that these improved normality helps to determine the assumptions for regression models including linearlity, normality of residuals are better satisfied. Nevertheless, there are still remaining effects of zero-inflation and outliers in some features, potentially causing skew regression results so they should be treated with caution.

Scaling

At this step, I will apply Z-scaling to transform numerical features to have a mean of 0 and a standard deviation of 1 to ensure all features contribute equally to the model, avoiding cases of features with larger magnitudes or units from domination regarding of these aspects.

In terms of consistency across features, models like linear regression are sensitive to the scale of numerical features. In the linear models' cases, scaling helps gradient descent algorithms converge faster. Without scaling, optimization will find it challenging to find the global minimum because of uneven gradients across dimensions. Even after scaling, coefficients in linear models reflect the importance of each feature in standardized units, helping feature importance easier to compare. Scaling also help ensure penalities on coefficients to be applied uniformly for models like Ridge, Lasso and ElasticNet in the next phase. For OLS and interpretable linear models, scaling might not be required as the coefficients will adjust it. However, cross model comparison will be conducted so scaling the target might also be needed in the same method of my inputs.

Correlation Matrix after Transformation

Looking at this new Correlation Matrix, these features has strong correlation with the other, suggesting multicollinearity. Therefore, the bolded features below will be dropped to minimize model complexity and increase interpretability.

There is a very low correlation between HouseStyle and BldgType, not suggesting multicollinearity, so both features will be retained for further analysis.

Explanation for weak or no correlation

Log_TotalLivingArea with Log_Price_Per_TotalLivingArea

This feature reflects the total living area, a fundamental determinant of price and the new metric. So its low correlation makes sense.

Log_LotArea and Log_LotFrontage with Log_Price_Per_TotalLivingArea

The low positive correlation suggests that larger lot areas impact price per square foot mildly. This might reflect economies of scale, where larger lots have a insignificantly marginal impact on price per square foot.

Explanation for moderate correlation

GarageCars with Log_Price_Per_TotalLivingArea

Garage area shows a moderately positive correlation with price per square foot. While larger garages are desirable, they may not contribute as strongly to pricing compared to living spaces as Garage capacity in terms of number of cars it can hold is more meaningful and impactful on sale price.

TotalBath with Log_Price_Per_TotalLivingArea

This features shows moderately positive correlation with the target. The fact is almost all housing properties must have at least a full bath or a half bath and buyers care about the number of bathrooms. So they do impact the sale price moderately.

Log_Price_Per_1stFlrSF with Log_Price_Per_TotalLivingArea

Price per square foot for the first floor strongly correlates with overall price per living area. This feature directly influences the target because larger or higher quality first floors generally increase the price. The stronger Pearson correlation suggests that the relationship is more linear in this case.

Log_Price_Per_LotArea with Log_Price_Per_TotalLivingArea

This feature indicates value per land area, particularly relevant in high-demand neighborhoods. This makes sense as larger lot area tends to have larger living area.

Pearson Correlation can help to measure linear relationships between features. However, it assumes normally dsitributed data and sensitivity to outliers. So it is suited for numerical features that I have transformed to approximate normality.

Meanwhile, Spearman Correlation can help to measure monotonic relationships including linear and non-linear that it does not assume normality or linearity. So it is more robust against outliers and better for ordinal or skewed numerical data.

As my numerical features were applied with Log transformation to approximate normality and some of these features' relationships with the target might not be linear, as well as my whole data include a mixed feature types with both categorical and numerical, Spearman correlation is a more comprehensive perspective.

At the moment, as per Spearman correlation, Log_2ndFlrSF, Garage Cars have the highest positive correlation with the target at moderate levels, respectively. This is partially true in the real market as properties having 2nd floor or more space of cars in garage tend to affect the price per square feet. However, it is not always true and does not mean that both of them should be the sole two features having the highest correlation among the selected features with the target as other factors in reality such as other space or number of rooms in the house should have more impact based on domain knowledge. Also, properties do not always have 2nd floor or garage so their existence might impact the target moderately. But as the size of the 2nd floor contributes directly to the target and suffers heavily from the zero inflation. This feature should be removed for further model simplification. Further assessment is needed to evaluate this hypothesis.

In the next phase, for features with low correlation, I will assess their importance during feature selection in regression models.

Statistical Summary for Building Type and House Style

These two features can be helpful categorical features for explaining the variability in the house prices as they are strongly correlated with the target. In order to achive our goal of inference purposes, there are some hypotheses to test with single family houses having greater average prices than townhouses. Later, linear regression can be used to assess their impact on SalePrice.

Bootstrap sampling can join later to to estimate confidence intervals for mean and median prices for each category, such as 1Fam vs Twnhs.

Bayesian modeling can incorporate prior knowledge about the expected impact of these two features by using prior distributions with the observed means and standard deviations for each group of 1Fam, 1Story. Then the Bayesian infrence can update predictions when new data comes in and quantify uncertaincy in price predictions for a certain category.

Violin plot after Log Transformation

The log transformation helps to compress the range of outliers for these features, bringing them closer to the main distribution. Likewise, it compresses the scale, making interquartile range more pronounced and showcasing the central tendency of the data. As a result, the symmetry has more balanced whisters and medians staying closer to the IQR's center as well as more uniform variability. Therefore, log transformation improve the scale consitency while tackling with skewness, outliers and model suitability.

Exclude some categorical features (without explanatory power for the target) TBA

Explanation for retaining the selected columns

As per the Violin plots:

In summary, categorical features like MSZoning, Neighborhood, and HouseStyle strongly influence price per living area. Neighborhood has the largest impact with significant variability across different categories. Features like Street and BldgType highlight distinctions between premium and lower-valued properties.

VIF for further Feature Selection

I will use VIF analysis to reduce the number of predictors as the encoding can explode the number of data shape.

As per the VIF analysis, Log_TotalLivingArea, MSSubClass and Log_LotArea show moderate multicollinearity (VIF > 3). BedroomAbvGr, KitchenAbvGr and GarageCars are well-behaved predictors (VIF > 2) and can remain in the model. As these features show VIF < 10, which is acceptable and not a considerably high multicollinearity in general, I will retain them for latter analysis. In the next phase, I will conduct further models and regularizations to check their relationships with the target if exclusion of any of them is necessary or not.

Feature Encoding for categorical features TBA

Feature Encoding is necessary as categorical features need to be treated to convert into numerical representations, such as with one hot encoding.

Lmplot Log Transformation

Strong Predictors:

GarageCars, TotalBath and potentially Log_LotArea have meaningful positive relationships with the target variable.

Negative Predictors:

BedroomAbvGr, KitchenAbvGr and Log_TotalLivingArea have negative relationships, suggesting diminishing returns or inefficiencies.

Weak Predictors:

MSSubClass and Log_1stFlrSF show weak linear trends and scatter.

In the next steps, I will evaluate thesize-related features (Log_1stFlrSF, Log_TotalLivingArea, Log_LotArea) if any of them should be removed to simplify the model. Further investigation about non-linear effects for features like Log_TotalLivingArea or BedroomAbvGr might be also necessary. Also, based on the lmplots, I can retain strong predictors like GarageCars and TotalBath. If time allows, testing interaction terms for weak predictors using OLS regression can be done after optimizing the modelling and find the most optimal model defining the most significant features impacting the target.

Save data

Methods of Modelling

My primary approach for modeling will concentrate on leveraging the Ordinary Least Squares (OLS) regression and ElasticNet regularization to analyze the relationships between the housing features and Sales Price. This will ultimately determine the most significant factors for housing affordability and sales price while concerning the issues from multicollinearity and overfitting.

Feature Selection and Interactions

All features in my dataset will go through the same evaluation of correlation matrix before and after log transformation, as completed above, to identify significant predictors. I already saw features having low correlation or low predictive power on the target feature so they will be excluded in the next phase to fit the models. Those features with skewness due to their dominant single values, as experienced from the previous statistical plots, might be put into consideration for latter exclusion. Outliers will also be treated by exclusion iteratively to avoid overfitting and enhance individual features' influence without bias. Another aspect of analysis I have been considering will be executed is about interactions, for example, those between significant predictors: Building Type and Above Ground Living Area. Those terms can be explored to understand better the relationships with nuances, and thereforem enhance the ability of the tested models to reflect the dynamics of housing features.

Model Selection

I will employ both OLS regression and Elastic regularization to conduct feature selection. OLS regression can provide crucial metrics, for example, p-value of coefficients, F-statistic and adjusted R^2 to evaluate the individual predictors' significance on the model and the general fitting. Then, insignificant features will be excluded iteratively to minimize model computation and maximize model interpretability. Besides, ElasticNet regularization will play a significant role in my project to balance the feature selection and multicollinearity wit a combination of Lasso L1 and Ridge L2 penalities. I find it important for more enhancement in next phase to add a tuning step of regularization parameters will be made with cross-validation techniques to choose the optimal set of features.

Diagnostics

After generating the first model fitting, I will use Influence Plot to determine data points with high leverage and influence which impact the model disproportionately. These points can be outliers that require removal to refit the model. Then, other plots such as QQ and Residual vs Predicted values will also be used again to assess the assumptions of normality, homoscedasticity and linearity.

Bootstrapping

I will apply the bootstrap resampling to calculate the confidence intervals for the coefficients which can reduce the influence of variability in data. Therefore, this method can help me ensure stability and robustness of feature coefficients and effects on the target.

Bayesian Analysis

After completing previous steps in the next phase with regression methods, I will apply Bayesian models to estimate the posterior distributions of crucial predictors. Probabilistic insights can be drawn for me to determine feature's influence and incorporate prior domain knowledge. For features like Neighborhood or HouseStyle, Bayesian models can estimate group-level effects while incorporating uncertainty at each level.

I will specify priors with hierarchical priors to reflect nested structures for features like HouseStyle or Neighborhood. Markov Chain Monte Carlo (MCMC) or Hamiltonian Monte Carlo (HMC) can come into help me estimate posterior distributions. The I will analyze posterior distributions to capture the effect size and uncertainty for each predictor, as well as group-specific effects on differences in price by neighborhood. At last, this approach will provide credible intervals to evaluate the range of true parameter values.

Hierachical Models

In the next phase, if time allows, I will implement hierarchical (multi-level) models for features such as HouseStyle, BldgType and Neighborhood given their contextual and hierarchical nature. It seems to me that Hierarchical models are particularly suitable for features with a nested or grouped structure, where data points within the same group are likely to share some common characteristics as top candidates for these models.

Regarding Neighborhood, homes within the same neighborhood likely share similar pricing influences due to location-specific factors like school districts, accessibility, or crime rates. In this dataset, I have only name values of this feature without further information but still, a hierarchical model will allow to account for neighborhood-level variability, such as some neighborhoods consistently having higher or lower prices.

Furthermore, different house styles, including those with high demand such as 1Story, 2Story influence design preferences and construction costs, impacting buyer perception and pricing. Therefore, by grouping homes based on HouseStyle, the model can account for style-specific effects.

Likewise, buildings classified as detached 1Fam or shared walls like Duplex may have distinct pricing structures. Hierarchical modeling helps isolate group-level effects for building type.

In doing so, I will group the data by these features and use group-level intercepts and slopes to capture shared variance within each group. With random interceps, it can account for baseline differences across these groups. I will also try random slopes if possbile to capture group-specific variations in relationships. These models can help me to quantify between-group variability and assess the within-group predictors, like variability in prices across neighborhoods, like how living area affects price differently within a neighborhood. This approach will bring insights into the relative importance of group-level effects on housing affordability.

Part 2: MODELLING

Expected Outcomes

The project will provide quantitative insights, practical applications and validation. It will help to identify the most signifcant features impacting sales price and affordability with probabilistic insights into the impact of key features, to recommend buyers on certain features to pay more attention when buying within budgets and to valdidate the robustness of statistical findings based on confidence intervals to reduce uncertainty.

OLS Regression

1/ Initial OLS regression after log transformation

2/ OLS Model after removing influential points

Model 1 vs Model 2

Metric/Criteria Model 1 Model 2 Comparison and Comments
R-squared 0.597 0.684 Model 2 shows a significant improvement in model fit after removing influential points.
Adjusted R-squared 0.586 0.675 Adjusted R-squared confirms the improved performance, adjusting for model complexity.
RMS of Residuals 0.635 0.507 Residual error decreases, indicating better fit.
AIC 2894 2085 Model 2 has a lower AIC, showing improved model quality.
BIC 3101 2283 Significant drop in BIC; Model 2 is more parsimonious.
Condition Number 3.77e+03 9.12e+15 Condition number increases, suggesting potential multicollinearity after adjustments.
Influential Points (Count) 101 51 Removing influential points stabilizes Model 2 and reduces bias.
Coefficient Comparison Mixed significance Stable, reduced outliers Model 2 shows fewer extreme coefficients due to removed influential points.
Skewness -0.701 -0.377 Skewness reduces significantly, indicating better symmetry in residuals.
Kurtosis 6.061 3.744 Kurtosis decreases, suggesting more normal residual distribution.
QQ Plot Deviations from line Aligned with line QQ plot shows better normality for residuals in Model 2.
Histogram of Residuals Skewed Centered and normal Residuals follow a closer normal distribution in Model 2.
Residuals vs Predicted Wide spread Tighter spread Residuals improve, showing reduced variance in predictions.

In summary, model 2 improves greatly by removing influential points and retaining all features, leading to better residual spread and fit. I can see some key improvements as below:

Elastic Net regularization (naive approach with predefined parameters)

I will generate OLS regression on the log transformed features for initial insights. Then ElasticNet will come in to help on feature selection for further comparision with results driven by the initial OLS model. A naive approach of ElasticNet with a predefined parameters will determine the insignificant features that might be potentially removed. The next OLS model removing such feature(s) will be compared with those insights driven from the assumptions of the original OLS model, and provide more interpretation and diagnostics. A last but not least step in my project will be to validate my final model with residual analysis, bootstrapping and Bayesian methods to consider that the optimal model of each model kind can be reliable and effective.

Based on the Elastic Net, almost all categorical features are considered significant ones impacting the target, except the majority of the MSZoning and Street (Street_Pave). These two features will be removed over time in the next models to examine the remaining features' impact on the target without either one or both of them.

3/ OLS with reduced features

Model 2 vs Model 3

Metric/Criteria Model 2 Model 3 Comparison and Comments
R-squared 0.684 0.684 No change in model fit; features remain consistent.
Adjusted R-squared 0.675 0.675 Performance remains identical after re-examination of influential points.
RMS of Residuals 0.507 0.507 No visible difference in residual error.
AIC 2085 2085 Model quality remains unchanged.
BIC 2283 2283 No difference in model selection criteria.
Influential Points (Count) 51 60 Slight increase in influential points due to minor feature refinements.
Condition Number 9.12e+15 3.87e+03 Condition number improves significantly, addressing multicollinearity issues.
Coefficient Comparison Some instability Stable and precise Model 3 resolves coefficient instability by reducing multicollinearity.
Skewness -0.377 -0.393 Skewness remains stable with slight deviation.
Kurtosis 3.744 3.762 Minimal change in kurtosis; residuals remain close to normal.
QQ Plot Aligned with line Aligned with line Both models show similar QQ plots, indicating normality.
Histogram of Residuals Normal Normal Residual distribution remains similar.
Residuals vs Predicted Symmetric spread Symmetric spread Residuals are consistent between the two models.

Dropping Street_Pave in Model 3 does not affect model performance. Some highlights on the improvement made by the model 3 are:

Therefore, the Street feature is not a statistically significant impact on determining the sales price by total living area as compared to other features. This is also quite true in reality as other features, such as Neigborhood of the property can be even more significant driving the sales price by living area. Those will affect buyer's affordability and choices more directly as features within the house and the location where the house is at will more crucial in buyer's mind for different reasons: safety, comfortability, lifestyle, basic and enhanced needs.

Variance Inflation Factor (VIF)

I will conduct a VIF test to find multicollinearity for possible cases among all features for the next model fit.

As per the VIF, the high VIF values for MSZoning variables indicate strong linear dependencies among these variables or with other predictors, making them redundant. Including these variables would lead to inflated standard errors and unreliable coefficient estimates.Removing MSZoning improves model stability and interpretability without significantly sacrificing explanatory power, as its effects are likely captured by other predictors like Neighborhood or TotalBath.

The VIF also indicates a very high value of Street_Pave features, suggesting strong linear dependency, making it unncessary. It should be removed at the end. However, for OLS Regression models comparison, the 4th OLS model will retain it for research purpose as the 3rd OLS model already removed it as suggested by the ElasticNet with naive approach earlier.

Although some Neighborhood variables have slightly high VIF values, they are not excessively high, indicating moderate multicollinearity rather than severe. These variables provide unique and important location-specific information that cannot be replaced by other predictors. The hierarchical Bayesian model requires Neighborhood variables to estimate group-level effects. Neighborhood serves as the grouping factor for partially pooling information across neighborhoods, which is crucial for understanding variability at the neighborhood level. Eliminating Neighborhood would undermine the purpose of the hierarchical model, as it would remove the ability to study group-level variations and hierarchical relationships.

4/ OLS after VIF

Model 3 vs Model 4

Metric/Criteria Model 3 Model 4 Comparison and Comments
R-squared 0.684 0.676 Slight drop in R-squared due to feature reduction using VIF analysis.
Adjusted R-squared 0.675 0.668 Adjusted R-squared confirms minor reduction after multicollinearity correction.
RMS of Residuals 0.507 0.513 Residual error increases slightly but remains acceptable.
AIC 2085 2112 AIC increases slightly, indicating a trade-off between simplicity and fit.
BIC 2283 2289 Marginal increase in BIC due to reduced features.
Condition Number 3.87e+03 3.48e+03 Condition number improves, confirming reduced multicollinearity.
Influential Points (Count) 60 60 No change in influential points.
Coefficient Comparison Stable Slightly refined Model 4 removes redundant features and stabilizes coefficients.
Skewness -0.393 -0.377 Skewness reduces slightly, improving symmetry.
Kurtosis 3.762 3.744 Kurtosis remains stable, with minimal deviation.
QQ Plot Aligned Slight deviations Minor deviations observed in Model 4.
Histogram of Residuals Normal Slight skew Residuals slightly skewed in Model 4.
Residuals vs Predicted Symmetric spread Slight deviations Minor residual spread observed in Model 4.

Dropping MSZoning in Model 4 resolves multicollinearity at a minor cost to fit. Here are some key observations:

Sacrificing MSZoning can still help interpret key drivers of sale price by total living area as buyers tend to care more about features, other than Zoning of governments or policy makers classifying areas for different purposes of construction. This might be as expected in the current scenario and data for Ames, USA.

ElasticNet Application with Fine-Tuned Hyperparameters and K-Fold Cross-Validation (sophisticated approach)

ElasticNet combines both L1 (Lasso) and L2 (Ridge) regularization techniques and strength:

However, without careful hyperparameter tuning, I might underfit the model, if too strong regularization is applied or overfit the model, if too weak regularization is applied, and inefficiently handle multicollinearity, as experienced in Model 1 and Model 2 or irrelevant features, as not experienced in the Part 2 Modelling as insignificant features were pruned in the EDA phase.

In the next steps to improve ElasticNet performance, I will configure hyperparameter tuning with ElasticNetCV to fine tune them and ensure the model generalizes well to unseen data.

Also, Repeated K-Fold Cross-Validation is also needed to reduce overfitting risk and improve generalizability by splitting data into 10 folds, repeated 3 times, ensuring robust validation results. Further more, I will set up a higher tteration limit with max_iter=10000 to ensure model convergence, particularly when using small alphas or when multicollinearity is significant. It can help to optimize alpha and L1 ratio by avoid manually guessing hyperparameters, ensuring the best balance between bias and variance.

Split into X, y of train and test

Optimal Alpha

Select Optimal Alpha

Assess Feature Importance as per the Optimal Elastic Net

Regarding the ElasticNet Coefficients for feature selection:

This demonstrates ElasticNet's ability to filter out irrelevant features while keeping key predictors as alpha increases which make coefficients shrink progessively. This helps identify which features remain robust under varying levels of regularization.

When looking at the MSE vs Regularization Parameter, the training and testing errors diverge as alpha increases. With low alpha, there is low training error but higher test error due to overfitting. With an optimal alpha at ~0.00256, it balances bias and variance (test RMSE: 0.551126856586969). In contrast, with high alpha, both errors increase due to underfitting. It also improves feature selection with sparse coefficients simplify the model while retaining domain-relevant predictors. Further more, it provides robust validation with k-fold CV reducing variability and ensuring consistent performance across different splits of data. Therefore, important features are clear, such as Neighborhood_NridgHt, GarageCars and Log_TotalLivingArea.

5/ OLS with most significant features after sophisticated Elastic Net

Model 4 vs Model 5

Metric/Criteria Model 4 Model 5 Comparison and Comments
R-squared 0.676 0.676 No change in overall model performance.
Adjusted R-squared 0.668 0.668 Performance remains identical after ElasticNet pruning.
RMS of Residuals 0.513 0.513 Residual error is consistent between models.
AIC 2112 2112 No difference in model quality based on AIC.
Condition Number Stable Stable Condition number remains controlled, multicollinearity resolved.
Coefficient Comparison Stable coefficients Simplified coefficients Model 5 retains most significant coefficients via ElasticNet.
Skewness -0.377 -0.393 Slight improvement in symmetry in Model 5.
Kurtosis 3.744 3.762 Residual distribution remains close to normal.
QQ Plot Slight deviations Aligned with line QQ plot improves in Model 5, indicating better normality.
Histogram of Residuals Slight skew Centered and normal Residual distribution improves with ElasticNet adjustments.
Residuals vs Predicted Small spread Robust, symmetric Residuals improve, showing a more symmetric spread in Model 5.

Model 5 simplifies the model further using ElasticNet pruning while retaining key domain-relevant features. Here are some highlights:

In general context, the features of MSZoning and Street are often overlooked by the a great number of buyers. In the Ames context, it is proven of a great number of properties have driven the sale price by total living by other remaining key features, which are more meaningful and statistically significant, not by those two features.

Overall Summary of OLS Regression Models

Metric/Criteria Model 1 Model 2 Model 3 Model 4 Model 5 Comments
R-squared 0.597 0.684 0.684 0.676 0.676 Model performance stabilizes at Model 4/5.
Adjusted R-squared 0.586 0.675 0.675 0.668 0.668 Improvements from Model 1 to Model 2 hold.
AIC 2894 2085 2085 2112 2112 AIC stabilizes despite feature simplification.
BIC 3101 2283 2283 2289 2289 No major penalty for model complexity.
RMS of Residuals 0.635 0.507 0.507 0.513 0.513 Residual error improves significantly early.
Condition Number 3.77e+03 9.12e+15 3.87e+03 3.48e+03 3.48e+03 Multicollinearity resolved in later models.
Influential Points (Count) 101 51 60 60 60 Reduction achieved in Model 2 remains stable.
Coefficient Comparison Mixed significance Reduced extremes Stable Simplified Simplified Simplified coefficients in Models 4 and 5 show robust significance.
Skewness -0.701 -0.377 -0.393 -0.377 -0.393 Generally decreasing in later models and a great change between Model 1 vs the rest of models.

Model progression from 1 to 5 shows a clear trade-off between model complexity, robustness, and performance. Model 2 has key improvement by addressing influential points. Model 3 simplifies the model with no performance loss. Model 4 resolves multicollinearity quite effectively. Model 5 offers optimal balance with ElasticNet pruning and domain-specific interpretability. It resolves multicollinearity issues, maintains a high R-squared, and minimizes influential points. This model should be chosen for further analysis. Each transition demonstrates statistical adjustments to improve the stability, accuracy and interpretability of the final model while improving performance metrics gradually.

Save Data

Parametric Bootstrapping

As I have retained the most efficient set of the most significant features from the five models, I will evaluate the uncertainty of the OLS regression coefficients, focusing on their sampling distributions and constructing confidence intervals (CIs). Parametric bootstrapping can allow me to quantify uncertainty by assessing the variability in parameter estimates (regression coefficients) and construct confidence intervals based on the experienced distribution of the bootstrapped coefficients. Then I can assess normality with the Q-Q plotto check how closely the bootstrapped coefficients follow a normal distribution. By resampling the data, I can ensure with more confidence that the OLS results are not overly sensitive to a specific set of data.

Most of parameters express nearly normal distribution with slight deviations only, except the features of MSSubclass, which is previously identified as insignifcant, and Neighborhood_Blueste. This could be that the Blueste might cause some outliers diminishing the explanatory power for driving the target.

Looking at the parameter BedroomAbvGr particularly as it is one of the key predictor for the target, having nearly normal distribution as many others. And its Confidence Interval has Lower Bound at -0.2578 and Upper Bound at -0.1581, meaning that there is a 95% probability that the true coefficient lies between -0.2578 and -0.1581. Regarding the QQ plot, the points align closely with the red line, indicating that the bootstrapped parameter values are consistent with a normal distribution. When referring to the histogram, the distribution appears approximately normal, centered around -0.2060 and the red dashed lines mark the confidence interval. Most of its distribution is within the confidence interval, indicating that this feature's values are relatively stable and consistent. In other words, it is not overly influenced by noise or random fluctuations. Therefore, this feature contributes is reliability and significant impact to the model's predictions because the observed values align well with the model's expected behavior.

Bayesian MCMC

Now I will apply Bayesian Monte Carlo sampling approach applies probabilistic regression modeling using Bayesian inference to understand and quantify uncertainty in predictions and model parameters. It is particularly useful in housing price analysis where data uncertainty, heterogeneity and complex relationships exist when I have multiple features contributing to the target.

In the Model Definition, I will assign weakly informative priors to the model parameters. When the linear model predicts the target as a combination of the predictors and their coefficients, the Gaussian likelihood accounts for observation noise.

Regarding the posterior sampling, in this Bayesian model, I will use Monte Carlo sampling, which is performed using Markov Chain Monte Carlo MCMC methods. The posterior distribution of the model parameters is generated, capturing uncertainty and variability in their estimates.

I will then diagnose further using these plots. Trace plots can visualize MCMC convergence for the intercept, coefficients and sigma. If they can provide stable and non-diverging traces, they will indicate convergence. Forest plots can summarize credible intervals, such as 94% HDI for the model coefficients, illustrating which predictors significantly impact the target: log sale price by total living area. Also, Posterior Predictive Checks PPC can compare observed values to simulated predictive samples to evaluate model fit. Observed vs Predicted can show that posterior predictive samples are averaged to produce predictions. Finally, a histogram illustrates the overlays between the observed target variable and the predicted target for visual comparison.

Insights

1/ Posterior Predictive Checks:

The comparison of observed vs. predicted log prices shows a good alignment between the actual and predicted values, indicating that the model captures the underlying trends in housing prices effectively.

The posterior predictive checks further reassure the model's predictive accuracy, with the posterior predictive samples closely matching the observed data distribution.

2/ Feature Significance and Impact:

Some neighborhoods, such as IDOTRR, SWISU and OldTown, show significantly negative coefficients, suggesting lower housing prices by living area compared to the baseline neighborhood. These areas might represent less affluent or less desirable locations, impacting housing affordability positively for budget-conscious buyers.

On the other hand, neighborhoods like NridgHt, StoneBr and Somersett have positive coefficients, reflecting higher property values. These are likely more affluent or desirable areas with higher demand, impacting affordability negatively.

The number of bedrooms, bathrooms and garage capacity show varying impacts. BedroomAbvGr and KitchenAbvGr negatively affect log prices by living area, possibly reflecting diminishing returns for larger houses in certain contexts.

Log_LotArea and GarageCars positively influence prices, indicating that larger lots and additional parking spaces are desirable features driving up property values.

Categorical variables like HouseStyle_Single_Family_1 and BldgType_Single_Family_1 indicate preferences for single-family homes, positively contributing to prices, which might reflect consumer preferences in suburban or less dense areas.

3/ Statistical Insights:

Credible intervals for coefficients highlight the uncertainty in parameter estimates. For instance, the wide intervals for some neighborhoods, for example, Blueste, indicate variability in housing prices within these areas, possibly due to heterogeneous housing stock or other unobserved factors.

The ranking by absolute mean helps identify the most influential features, with neighborhoods dominating the list. This underscores the importance of location as a primary determinant of housing affordability.

4/ Bayesian Modeling:

By including neighborhood-level intercepts, the Bayesian model accounts for group-level variability. This approach captures both local effects (neighborhood-specific trends) and global patterns (shared effects across all data), providing a nuanced understanding of housing price determinants.

Implications for Housing Policy

Targeting neighborhoods with lower intercepts for affordable housing development can aid policymakers in balancing housing supply. Neighborhoods with negative coefficients, for example, IDOTRR, SWISU, offer more affordable housing options, which could be beneficial for lower-income buyers. However, these areas may require additional investments in infrastructure or amenities to improve livability. The high variability in some neighborhoods highlights the potential for finding affordable housing options even within relatively expensive areas.

Features like Log_LotArea and GarageCars are associated with affordability. Development strategies can emphasize these features to meet housing demands within budget constraints.

Investing in neighborhoods with negative coefficients can improve affordability and living conditions, potentially driving demand and equalizing housing disparities. Encouraging diverse housing styles and types, such as single-family homes, could meet consumer preferences while balancing supply-demand dynamics.

In summary, by integrating numerical and categorical features with a hierarchical structure, the Bayesian model demonstrates a powerful framework for understanding housing affordability and informing data-driven policy decisions.

Visual Analytics

As per the Trace plot, the coefficients, intercept and sigma show well-mixed MCMC chains, indicating convergence and reliable posterior sampling. Neighborhoods with negative coefficients, for example, IDOTRR, SWISU, offer more affordable housing options, which could be beneficial for lower-income buyers. However, these areas may require additional investments in infrastructure or amenities to improve livability. The high variability in some neighborhoods highlights the potential for finding affordable housing options even within relatively expensive areas.

As per the Forest plot, it illustrates credible intervals (HDI) for coefficients and intercept. Predictors with intervals excluding zero have significant influence on housing prices.

As per the Posterior Predictive Check, the observed target values along the black line align well with posterior predictive samples, suggesting the model captures the overall distribution of Log_Price_Per_TotalLivingArea.

In the plot of Observed vs Predicted Histogram, the predicted log prices align closely with the observed values with only slight deviations under and over prediction in some regions, which is considered normal.

Statistical Inference and Limitations

The close match between observed and predicted values demonstrates a good model fit, but outliers or unexplained variance could suggest additional features to consider, for example, distance to nearby school, train station, airport or supermarket, crime rates, etc.

The hierarchical model's ability to separate global and local effects ensures robust inferences. However, overlapping credible intervals for some features indicate areas where the model's predictive certainty is lower.

Hierachical Bayesian Models

Pros and Cons

Key Metrics for Comparison:

It evaluates model fit by comparing predicted vs observed data distributions.

It measures the adequacy of model predictions against observed data.

It compares model performance using predictive accuracy.

Hierachical Model

Insights

The model accommodates neighborhood-specific intercepts (a) and group-level regression coefficients (beta), capturing variations in housing affordability across neighborhoods.

Negative Intercepts: Neighborhoods such as OldTown, Edwards, and IDOTRR have highly negative intercepts, indicating lower baseline housing prices compared to others. These neighborhoods may be considered more affordable, possibly due to factors like older housing stock, less desirable location, or lower demand.

Less Negative/Positive Intercepts: Neighborhoods like NoRidge, NridgHt, and StoneBr have less negative or slightly positive intercepts, indicating higher baseline prices. These are likely more affluent or desirable areas, making them less affordable for buyers.

The credible intervals (HDI) for intercepts provide uncertainty bounds for neighborhood-specific effects, highlighting variability in affordability. For example, a[BrDale] has a wider HDI compared to a[NoRidge], reflecting greater variability in housing prices.

The coefficients for numerical features quantify their impact on housing prices per unit increase while holding other factors constant.

GarageCars (+0.169): A positive coefficient indicates that adding a garage space significantly increases the housing price, reflecting its importance as a desirable feature.

TotalBath (+0.147): More bathrooms are associated with higher housing prices, consistent with buyer preferences for convenience.

BedroomAbvGr (-0.165): A negative coefficient suggests that additional bedrooms may reduce price-per-unit living area, likely due to diminishing returns in larger homes.

Log_LotArea (+0.070): Larger lot sizes positively impact price, but the effect is modest.

Features with credible intervals (HDI) excluding 0, for example, GarageCars, TotalBath are statistically significant predictors, while others, for example, Log_LotArea have weaker or less certain effects.

Categorical features, such as zoning, building type, provide additional context for housing affordability.

MSZoning_RL, MSZoning_RM, MSZoning_RH: These zoning categories have large positive coefficients, indicating significantly higher housing prices. This likely reflects zoning regulations favoring larger or higher-value properties.

TotRmsAbvGrd (-0.276): A negative coefficient for total rooms above ground suggests diminishing returns for larger homes in terms of price per unit area.

HouseStyle_Single_Family (+0.154): This feature indicates a preference for single-family homes, contributing positively to prices.

Credible intervals for some categorical features are wide, Street_Pave, suggesting variability or weaker evidence for their effect.

Large variance in neighborhood intercepts (a), reflecting significant heterogeneity in neighborhood-level housing prices.

Consistent coefficients for individual features like BedroomAbvGr, GarageCars and TotalBath, showing their uniform influence on affordability.

Superior PPCs and BPV suggest it balances complexity and predictive accuracy better than the other models.

Strong LOO performance (best predictive ability).

Affordability Hotspots: Neighborhoods like Edwards, OldTown, and IDOTRR are more affordable but may have lower desirability or older housing stock.

Premium Areas: Neighborhoods like StoneBr and NridgHt are less affordable due to higher baseline prices, likely driven by better amenities, location, or newer housing stock.

Feature Contributions: Features like garage spaces and bathrooms positively influence prices, suggesting a preference for functionality.

The hierarchical model captures housing price disparities effectively across neighborhoods, indicating that regional differences heavily influence affordability. It supports localized policy-making for affordability improvement.

Pooled Model

Insights

Single Shared Coefficients: All neighborhoods share a common intercept and coefficients.

BedroomAbvGr: The coefficient is negative (-0.244) with a 94% HDI that excludes zero (-0.293, -0.191). More bedrooms above grade are associated with lower log price per total living area. This could indicate that houses with excessive bedrooms relative to total living area are less efficient in utilizing space, impacting affordability.

KitchenAbvGr: The negative coefficient (-0.205) also excludes zero (-0.250, -0.158). Additional kitchens above grade decrease housing price efficiency, possibly due to inefficient layouts or overutilization of space for non-essential rooms.

GarageCars: Positive and significant (0.326) with a narrow HDI (0.285, 0.368). More garage space is a positive indicator for higher housing prices, reflecting preferences for parking or additional storage.

TotalBath: Positive coefficient (0.219) is significant (HDI: 0.176, 0.268). More bathrooms increase the value of a property, highlighting their importance in housing desirability.

TotRmsAbvGrd: Negative and significant (-0.209, HDI: -0.264, -0.149). More total rooms above grade may reduce affordability, perhaps due to inefficient use of space or larger, less affordable homes.

HouseStyle_Single_Family: Positive and significant (0.259, HDI: 0.166, 0.356). Single-family homes tend to have higher prices, aligning with higher desirability but reduced affordability.

The pooled model assumes all neighborhoods share the same coefficients for features, pooling data across the entire dataset. This simplifies interpretation but risks oversimplifying neighborhood-specific trends. Significant numerical predictors like GarageCars and TotalBath emphasize the importance of functional amenities in housing value.

Affordability Concerns:

Negative coefficients for features like BedroomAbvGr and KitchenAbvGr suggest that inefficiently designed homes with excessive non-essential features can adversely affect price efficiency, impacting affordability. Positive effects from features like GarageCars and TotalBath reflect market preferences, potentially pushing prices higher.

The pooled model does not capture neighborhood-level variations explicitly, possibly masking localized trends. While significant, coefficients for zoning and amenities may differ in hierarchical or unpooled models, which consider neighborhood-specific variations.

Tight posterior distributions for all coefficients due to pooling but ignores neighborhood-specific variability.

Underestimates variance in affordability factors.

Worst PPC and BPV performance, indicating poor fit to the data.

Significantly worse LOO score, highlighting limited predictive power.

The pooled model oversimplifies housing affordability, failing to capture regional disparities. It is unsuitable for nuanced inference.

Unpooled Model

Insights

The unpooled model estimates a separate coefficient for each neighborhood or categorical group without sharing information across groups. This approach assumes no hierarchical structure, allowing coefficients to vary independently for neighborhoods and other categorical features. The goal is to understand the specific effects of each neighborhood and feature on housing prices.

Each neighborhood has its own regression model without shared information.

BedroomAbvGr (-0.165) and KitchenAbvGr (-0.163) have negative coefficients, indicating that an increase in these variables leads to a slight decrease in the log price per total living area. This may imply diminishing returns for additional bedrooms or kitchens in terms of affordability.

GarageCars (0.164), TotalBath (0.147) have positive coefficients, showing that these features add value to housing prices. For instance, additional garage capacity and bathrooms significantly improve housing value.

Neighborhood Effects: The coefficients for neighborhoods show varying effects on housing prices:

Negative Coefficients: Neighborhoods such as IDOTRR (-1.155), OldTown (-1.057), and SWISU (-0.854) negatively impact housing prices. These areas might have less desirable amenities, infrastructure, or other factors making housing more affordable.

Positive Coefficients: Neighborhoods such as NridgHt (0.685), StoneBr (0.614), and Somerset (0.392) positively impact housing prices, indicating higher demand and possibly better amenities.

Wide Credible Intervals: Some neighborhoods, such as Blueste, have wide posterior distributions, suggesting high variability or uncertainty in their price effects.

Neighborhood Affordability: Neighborhoods with highly negative coefficients, for example, IDOTRR, OldTown are more affordable due to lower prices, but these areas might lack premium amenities or infrastructure.

Premium Areas: Positive coefficients for neighborhoods such as NridgHt and StoneBr highlight areas with higher housing prices, likely driven by better infrastructure, schools, or amenities.

Feature Importance: The importance of features like GarageCars, TotalBath, and Log_LotArea indicates that functional and practical aspects of housing significantly influence prices.

Credible Intervals: The 94% HDI intervals provide a robust measure of uncertainty. For instance, while NridgHt is associated with a positive effect, the narrow credible interval suggests high confidence in its premium value.

Similar patterns for feature coefficients (Beta1), but the lack of shared information reduces robustness in underrepresented neighborhoods.

Posterior distributions for intercepts (Beta0) are broader than the hierarchical model.

Performs comparably to the hierarchical model in PPC and BPV but shows slightly inferior LOO scores.

The unpooled model emphasizes neighborhood-specific affordability without sharing statistical strength, which may lead to overfitting or instability in low-data neighborhoods.

Posterior Predictive Checks and Model Comparison

As per the Posterior Predictive Checks, the Hierarchical Model saw that the predictive distributions align closely with the observed data, suggesting that the model captures variability effectively and provides a robust fit. Meanwhile, the Pooled Model experienced the predictive distributions appearing less flexible, indicating limited variation and poorer alignment with observed data, especially at the tails of the distribution. Lastly, the Unpooled Model saw that the predictive distributions illustrate moderate alignment with observed data, but not as well as the hierarchical model, indicating slightly overfitting to individual observations or groups.

Regarding the Model Comparison using LOO (Leave-One-Out Cross-Validation), the Hierarchical Model achieves the highest ELPD, as expected with log pointwise predictive density, suggesting that it has the best predictive performance among the three models. It effectively balances flexibility and generalizability with a group-level structure. Meanwhile, the Unpooled Model experienced slightly worse ELPD than the hierarchical model, indicating that it captures less shared structure among groups and is less generalizable. However, the Pooled Model performs significantly worse with a large ELPD difference as it fails to account for group-level variation.

In summary, I find that the hierarchical model is superior in balancing group-level structure and individual variability, making it the best choice for predictive performance. Unfortunately, the pooled model oversimplifies the problem by assuming homogeneity across groups, which causes poor predictive performance. In contrast, the unpooled model offers some improvements over the pooled model but lacks the shared structure captured by the hierarchical model. Therefore, the hierarchical model outperforms both pooled and unpooled models in predictive performance and flexibility. It is the most robust choice when group-level structure is important.

MAP Estimates and Parameter Distributions

As per the MAP Estimates (Maximum a Posteriori):

Regarding the Parameter Distributions:

In summary, the hierarchical model provides a balance between group-level and individual-level effects, regularizing coefficients, reducing overfitting and improving robustness. The unpooled model captures more individual variability but risks overfitting due to the lack of shared structure among groups. The regularization effect of the hierarchical model leads to more reliable estimates, as reflected in the narrower parameter distributions, and making it a better choice than the unpooled model for scenarios requiring balanced group-level and individual-level insights.

Final Summary

My project involves analyzing housing affordability using Bayesian hierarchical and unpooled models. Key features such as neighborhood attributes, numerical housing characteristics, such as bedrooms, lot sizes, and categorical factors, such as building type, house style, were included.

Data Preparation:

I processed and standardized numerical features as well as converted categorical variables into one-hot or binary numerical formats. I also removed missing values to ensure data integrity.

OLS Regression:

Built an OLS model as a baseline for analyzing housing price determinants. Key findings from OLS are that GarageCars and TotalBath were significant positive predictors of housing prices. Neighborhoods and categorical features such as BldgType and HouseStyle had a meaningful impact on affordability. Limitations identified with OLS are that it assumes homoscedasticity and independent errors, which may not capture hierarchical dependencies, such as neighborhoods.

Bayesian Modeling:

I developed Hierarchical Models with group-level effects (neighborhoods) using Laplace priors for interpretability in intercepts and coefficients. I also built an Unpooled Model for fully independent parameter estimation as well as a Pooled Model for independent parameter estimation regardless of neighborhood, and compared results using trace plots, posterior summaries and posterior predictive checks.

Correlation Analysis:

I examined relationships between key numerical features using Spearman and Pearson correlation matrices to identify significant predictors and features with multicollinearity for complexity reduction.

Insights:

Hierarchical models revealed varying neighborhood-level affordability trends. Unpooled models highlighted independent effects of specific features, like GarageCars and TotalBath, on housing affordability. Categorical variables such as BldgType and HouseStyle demonstrated significant predictive power.

Statistical Inference:

Posterior distributions provided credible intervals for coefficients, aiding decision-making. Insights support identifying key drivers of affordability and potential areas for policy intervention.

Impact:

My analysis offers a robust framework for housing market insights, combining OLS regression for interpretability and Bayesian techniques for flexibility and precision. It supports policymakers and stakeholders in understanding affordability drivers at both micro and macro levels. When new data and information are provided to embed with the past data, the posterior insights from my current Bayesian models can be used as prior information or beliefs for the new Bayesian models fitted with a combination of old and new data to update posterior insights.

Further Improvement:

Further Bayesian analysis applied time series data to analyze switchpoint of Economic crisis from Dec 2007 - June 2009 impacting the target and housing affordability and tracking how an event or policy can affect the trend based on causal inference. This modelling would require more sophisticated techniques handling factors contributing to priors, switchpoint and posteriors. Time series data often exhibit structural breaks or regime changes, such as sudden changes in trends so standard Bayesian models may fail to adapt to these changes. Also, choosing appropriate priors is critical but non-trivial, especially in high-dimensional time series datasets because poorly chosen priors can lead to biased or unstable posterior estimates. Therefore, in my current project, I have not extended to conduct more research on this analysis.

References

Diagnosing Biased Inference with Divergences

PyMC Team. "Diagnosing Biased Inference with Divergences." PyMC Examples: Diagnostics and Criticism. Link at: https://www.pymc.io/projects/examples/en/latest/diagnostics_and_criticism/Diagnosing_biased_Inference_with_Divergences.html

Hierarchical Modeling with Variational Inference

PyMC Team. "Hierarchical GLM with ADVI and Minibatch." PyMC Examples: Variational Inference. Link at: https://www.pymc.io/projects/examples/en/latest/variational_inference/GLM-hierarchical-advi-minibatch.html

Rugby Analytics with Bayesian Hierarchical Models

PyMC Team. "Bayesian Hierarchical Models in Rugby Analytics." PyMC Examples: Case Studies. Link at: https://www.pymc.io/projects/examples/en/latest/case_studies/rugby_analytics.html

Hierarchical Binomial Model with PyMC

PyMC Team. "GLM: Hierarchical Binomial Model." PyMC Examples: Generalized Linear Models. Link at: https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-hierarchical-binomial-model.html

Hierarchical Partial Pooling

PyMC Team. "Hierarchical Partial Pooling: A Case Study." PyMC Examples: Case Studies. Link at: https://www.pymc.io/projects/examples/en/latest/case_studies/hierarchical_partial_pooling.html

Bambi: A High-Level Bayesian Modeling Interface

Bambi Developers. Bambi: Bayesian Models Made Easy. Link at: https://bambinos.github.io/bambi/